Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 36
Filtrar
1.
Sci Rep ; 14(1): 9516, 2024 04 25.
Artículo en Inglés | MEDLINE | ID: mdl-38664448

RESUMEN

Recent technologies such as spatial transcriptomics, enable the measurement of gene expressions at the single-cell level along with the spatial locations of these cells in the tissue. Spatial clustering of the cells provides valuable insights into the understanding of the functional organization of the tissue. However, most such clustering methods involve some dimension reduction that leads to a loss of the inherent dependency structure among genes at any spatial location in the tissue. This destroys valuable insights of gene co-expression patterns apart from possibly impacting spatial clustering performance. In spatial transcriptomics, the matrix-variate gene expression data, along with spatial coordinates of the single cells, provides information on both gene expression dependencies and cell spatial dependencies through its row and column covariances. In this work, we propose a joint Bayesian approach to simultaneously estimate these gene and spatial cell correlations. These estimates provide data summaries for downstream analyses. We illustrate our method with simulations and analysis of several real spatial transcriptomic datasets. Our work elucidates gene co-expression networks as well as clear spatial clustering patterns of the cells. Furthermore, our analysis reveals that downstream spatial-differential analysis may aid in the discovery of unknown cell types from known marker genes.


Asunto(s)
Teorema de Bayes , Perfilación de la Expresión Génica , Transcriptoma , Perfilación de la Expresión Génica/métodos , Análisis por Conglomerados , Humanos , Análisis de la Célula Individual/métodos , Redes Reguladoras de Genes , Algoritmos , Simulación por Computador
2.
Biometrics ; 80(1)2024 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-38364805

RESUMEN

Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazard assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian treed hazards partition model that is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the log-hazard function in each partition using a latent Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is compared with some existing methods on simulated data and a liver cirrhosis dataset.


Asunto(s)
Algoritmos , Modelos de Riesgos Proporcionales , Teorema de Bayes , Cadenas de Markov , Método de Montecarlo
3.
Genet Epidemiol ; 47(1): 95-104, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36378773

RESUMEN

The clustering of proteins is of interest in cancer cell biology. This article proposes a hierarchical Bayesian model for protein (variable) clustering hinging on correlation structure. Starting from a multivariate normal likelihood, we enforce the clustering through prior modeling using angle-based unconstrained reparameterization of correlations and assume a truncated Poisson distribution (to penalize a large number of clusters) as prior on the number of clusters. The posterior distributions of the parameters are not in explicit form and we use a reversible jump Markov chain Monte Carlo based technique is used to simulate the parameters from the posteriors. The end products of the proposed method are estimated cluster configuration of the proteins (variables) along with the number of clusters. The Bayesian method is flexible enough to cluster the proteins as well as estimate the number of clusters. The performance of the proposed method has been substantiated with extensive simulation studies and one protein expression data with a hereditary disposition in breast cancer where the proteins are coming from different pathways.


Asunto(s)
Neoplasias de la Mama , Humanos , Femenino , Teorema de Bayes , Neoplasias de la Mama/genética , Modelos Genéticos , Análisis por Conglomerados , Cadenas de Markov , Método de Montecarlo
4.
J Am Stat Assoc ; 116(535): 1075-1087, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34898760

RESUMEN

Estimating the marginal and joint densities of the long-term average intakes of different dietary components is an important problem in nutritional epidemiology. Since these variables cannot be directly measured, data are usually collected in the form of 24-hour recalls of the intakes, which show marked patterns of conditional heteroscedasticity. Significantly compounding the challenges, the recalls for episodically consumed dietary components also include exact zeros. The problem of estimating the density of the latent long-time intakes from their observed measurement error contaminated proxies is then a problem of deconvolution of densities with zero-inflated data. We propose a Bayesian semiparametric solution to the problem, building on a novel hierarchical latent variable framework that translates the problem to one involving continuous surrogates only. Crucial to accommodating important aspects of the problem, we then design a copula based approach to model the involved joint distributions, adopting different modeling strategies for the marginals of the different dietary components. We design efficient Markov chain Monte Carlo algorithms for posterior inference and illustrate the efficacy of the proposed method through simulation experiments. Applied to our motivating nutritional epidemiology problems, compared to other approaches, our method provides more realistic estimates of the consumption patterns of episodically consumed dietary components.

5.
Bernoulli (Andover) ; 27(1): 637-672, 2021 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-34305432

RESUMEN

Gaussian graphical models are a popular tool to learn the dependence structure in the form of a graph among variables of interest. Bayesian methods have gained in popularity in the last two decades due to their ability to simultaneously learn the covariance and the graph. There is a wide variety of model-based methods to learn the underlying graph assuming various forms of the graphical structure. Although for scalability of the Markov chain Monte Carlo algorithms, decomposability is commonly imposed on the graph space, its possible implication on the posterior distribution of the graph is not clear. An open problem in Bayesian decomposable structure learning is whether the posterior distribution is able to select a meaningful decomposable graph that is "close" to the true non-decomposable graph, when the dimension of the variables increases with the sample size. In this article, we explore specific conditions on the true precision matrix and the graph, which results in an affirmative answer to this question with a commonly used hyper-inverse Wishart prior on the covariance matrix and a suitable complexity prior on the graph space. In absence of structural sparsity assumptions, our strong selection consistency holds in a high-dimensional setting where p = O(nα ) for α < 1/3. We show when the true graph is non-decomposable, the posterior distribution concentrates on a set of graphs that are minimal triangulations of the true graph.

6.
Adv Exp Med Biol ; 1332: 211-227, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34251646

RESUMEN

Measuring usual dietary intake in freely living humans is difficult to accomplish. As a part of our recent study, a food frequency questionnaire was completed by healthy adult men and women at days 0 and 90 of the study. Data from the food questionnaire were analyzed with a nutrient analysis program ( www.Harvardsffq.date ). Healthy men and women consumed protein as 19-20% and 17-19% of their total energy intakes, respectively, with animal protein representing about 75 and 70% of their total protein intakes, respectively. The intake of each nutritionally essential amino acid (EAA) by the persons exceeded that recommended for healthy adults with a minimal physical activity. In all individuals, the dietary intake of leucine was the highest, followed by lysine, valine, and isoleucine in descending order, and the ingestion of amino acids that are synthesizable de novo in animal cells (AASAs) was about 20% greater than that of total EAAs. The intake of each AASA met those recommended for healthy adults with a minimal physical activity. Intakes of some AASAs (alanine, arginine, aspartate, glutamate, and glycine) from a typical diet providing 90-110 g food protein/day does not meet the requirements of adults with an intensive physical activity. Within the male or female group, there were not significant differences in the dietary intakes of all amino acids between days 0 and 90 of the study, and this was also true for nearly all other essential nutrients. Our findings will help to improve amino acid nutrition and health in both the general population and exercising individuals.


Asunto(s)
Aminoácidos , Dieta , Adulto , Ingestión de Alimentos , Ingestión de Energía , Femenino , Humanos , Masculino , Nutrientes
7.
Chemometr Intell Lab Syst ; 2122021 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-35068632

RESUMEN

BACKGROUND: The endogenous circadian clock, which controls daily rhythms in the expression of at least half of the mammalian genome, has a major influence on cell physiology. Consequently, disruption of the circadian system is associated with wide range of diseases including cancer. While several circadian clock genes have been associated with cancer progression, little is known about the survival when two or more platforms are considered together. Our goal was to determine if survival outcomes are associated with circadian clock function. To accomplish this goal, we developed a Bayesian hierarchical survival model coupled with the global local shrinkage prior and applied this model to available RNASeq and Copy Number Variation data to select significant circadian genes associates with cancer progression. RESULTS: Using a Bayesian shrinkage approach with the Bayesian accelerated failure time (AFT) model we showed the circadian clock associated gene DEC1 is positively correlated to survival outcome in breast cancer patients. The R package circgene implementing the methodology is available at https://github.com/MAITYA02/circgene. CONCLUSIONS: The proposed Bayesian hierarchical model is the first shrinkage prior based model in its kind which integrates two omics platforms to identify the significant circadian gene for cancer survival.

8.
PLoS One ; 15(10): e0238996, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33095785

RESUMEN

Recent developments in high-throughput methods have resulted in the collection of high-dimensional data types from multiple sources and technologies that measure distinct yet complementary information. Integrated clustering of such multiple data types or multi-view clustering is critical for revealing pathological insights. However, multi-view clustering is challenging due to the complex dependence structure between multiple data types, including directional dependency. Specifically, genomics data types have pre-specified directional dependencies known as the central dogma that describes the process of information flow from DNA to messenger RNA (mRNA) and then from mRNA to protein. Most of the existing multi-view clustering approaches assume an independent structure or pair-wise (non-directional) dependence between data types, thereby ignoring their directional relationship. Motivated by this, we propose a biology-inspired Bayesian integrated multi-view clustering model that uses an asymmetric copula to accommodate the directional dependencies between the data types. Via extensive simulation experiments, we demonstrate the negative impact of ignoring directional dependency on clustering performance. We also present an application of our model to a real-world dataset of breast cancer tumor samples collected from The Cancer Genome Altas program and provide comparative results.


Asunto(s)
Genómica/métodos , Modelos Estadísticos , Teorema de Bayes , Neoplasias de la Mama/genética , Análisis por Conglomerados , Simulación por Computador , Interpretación Estadística de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Genómica/estadística & datos numéricos , Humanos , Cadenas de Markov , Distribución Normal
9.
Biometrika ; 107(1): 205-221, 2020 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-33100350

RESUMEN

We develop a Bayesian methodology aimed at simultaneously estimating low-rank and row-sparse matrices in a high-dimensional multiple-response linear regression model. We consider a carefully devised shrinkage prior on the matrix of regression coefficients which obviates the need to specify a prior on the rank, and shrinks the regression matrix towards low-rank and row-sparse structures. We provide theoretical support to the proposed methodology by proving minimax optimality of the posterior mean under the prediction risk in ultra-high dimensional settings where the number of predictors can grow sub-exponentially relative to the sample size. A one-step post-processing scheme induced by group lasso penalties on the rows of the estimated coefficient matrix is proposed for variable selection, with default choices of tuning parameters. We additionally provide an estimate of the rank using a novel optimization function achieving dimension reduction in the covariate space. We exhibit the performance of the proposed methodology in an extensive simulation study and a real data example.

10.
Bioinformatics ; 36(13): 3951-3958, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32369552

RESUMEN

MOTIVATION: It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions, which might be dormant in a single-source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes' omics profile, such as copy number changes and RNA-sequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failure-time regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them. RESULTS: Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings. AVAILABILITY AND IMPLEMENTATION: The developed method is wrapped in R package available at https://github.com/MAITYA02/semmcmc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma , Genómica , Teorema de Bayes , Biología Computacional , Humanos , Análisis de Clases Latentes , Programas Informáticos
11.
J Mach Learn Res ; 21(79): 1-47, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-34305477

RESUMEN

Graphical models are ubiquitous tools to describe the interdependence between variables measured simultaneously such as large-scale gene or protein expression data. Gaussian graphical models (GGMs) are well-established tools for probabilistic exploration of dependence structures using precision matrices and they are generated under a multivariate normal joint distribution. However, they suffer from several shortcomings since they are based on Gaussian distribution assumptions. In this article, we propose a Bayesian quantile based approach for sparse estimation of graphs. We demonstrate that the resulting graph estimation is robust to outliers and applicable under general distributional assumptions. Furthermore, we develop efficient variational Bayes approximations to scale the methods for large data sets. Our methods are applied to a novel cancer proteomics data dataset where-in multiple proteomic antibodies are simultaneously assessed on tumor samples using reverse-phase protein arrays (RPPA) technology.

12.
Biometrics ; 76(1): 316-325, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-31393003

RESUMEN

Accurate prognostic prediction using molecular information is a challenging area of research, which is essential to develop precision medicine. In this paper, we develop translational models to identify major actionable proteins that are associated with clinical outcomes, like the survival time of patients. There are considerable statistical and computational challenges due to the large dimension of the problems. Furthermore, data are available for different tumor types; hence data integration for various tumors is desirable. Having censored survival outcomes escalates one more level of complexity in the inferential procedure. We develop Bayesian hierarchical survival models, which accommodate all the challenges mentioned here. We use the hierarchical Bayesian accelerated failure time model for survival regression. Furthermore, we assume sparse horseshoe prior distribution for the regression coefficients to identify the major proteomic drivers. We borrow strength across tumor groups by introducing a correlation structure among the prior distributions. The proposed methods have been used to analyze data from the recently curated "The Cancer Proteome Atlas" (TCPA), which contains reverse-phase protein arrays-based high-quality protein expression data as well as detailed clinical annotation, including survival times. Our simulation and the TCPA data analysis illustrate the efficacy of the proposed integrative model, which links different tumors with the correlated prior structures.


Asunto(s)
Biometría/métodos , Neoplasias/metabolismo , Neoplasias/mortalidad , Proteoma/metabolismo , Proteómica/estadística & datos numéricos , Teorema de Bayes , Simulación por Computador , Interpretación Estadística de Datos , Humanos , Neoplasias Renales/metabolismo , Neoplasias Renales/mortalidad , Cadenas de Markov , Modelos Estadísticos , Método de Montecarlo , Pronóstico , Análisis por Matrices de Proteínas/estadística & datos numéricos , Análisis de Supervivencia
13.
Cancer Inform ; 18: 1176935119871933, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-31488946

RESUMEN

Long non-coding RNAs (lncRNAs) are a large and diverse class of transcribed RNAs, which have been shown to play a significant role in developing cancer. In this study, we apply integrative modeling framework to integrate the DNA copy number variation (CNV), lncRNA expression, and downstream target protein expression to predict patient survival in breast cancer. We develop a 3-stage model combining a mechanical model (lncRNA regressed on CNV and target proteins regressed on lncRNA) and a clinical model (survival regressed on estimated effects from the mechanical models). Using lncRNAs (such as HOTAIR and MALAT1) along with their CNV, target protein expressions, and survival outcomes from The Cancer Genome Atlas (TCGA) database, we show that predicted mean square error and integrated Brier score (IBS) are both lower for the proposed 3-step integrated model than that of 2-step model. Therefore, the integrative model has better predictive ability than the 2-step model not considering target protein information.

14.
J R Stat Soc Ser C Appl Stat ; 68(5): 1577-1595, 2019 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-33311813

RESUMEN

We consider the problem where the data consist of a survival time and a binary outcome measurement for each individual, as well as corresponding predictors. The goal is to select the common set of predictors which affect both the responses, and not just only one of them. In addition, we develop a survival prediction model based on data integration. This article is motivated by the Cancer Genomic Atlas (TCGA) databank, which is currently the largest genomics and transcriptomics database. The data contain cancer survival information along with cancer stages for each patient. Furthermore, it contains Reverse-phase Protein Array (RPPA) measurements for each individual, which are the predictors associated with these responses. The biological motivation is to identify the major actionable proteins associated with both survival outcomes and cancer stages. We develop a Bayesian hierarchical model to jointly model the survival time and the classification of the cancer stages. Moreover, to deal with the high dimensionality of the RPPA measurements, we use a shrinkage prior to identify significant proteins. Simulations and TCGA data analysis show that the joint integrated modeling approach improves survival prediction.

15.
Bayesian Anal ; 14(2): 449-476, 2019 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-33123305

RESUMEN

There has been an intense development in the Bayesian graphical model literature over the past decade; however, most of the existing methods are restricted to moderate dimensions. We propose a novel graphical model selection approach for large dimensional settings where the dimension increases with the sample size, by decoupling model fitting and covariance selection. First, a full model based on a complete graph is fit under a novel class of mixtures of inverse-Wishart priors, which induce shrinkage on the precision matrix under an equivalence with Cholesky-based regularization, while enabling conjugate updates. Subsequently, a post-fitting model selection step uses penalized joint credible regions to perform model selection. This allows our methods to be computationally feasible for large dimensional settings using a combination of straightforward Gibbs samplers and efficient post-fitting inferences. Theoretical guarantees in terms of selection consistency are also established. Simulations show that the proposed approach compares favorably with competing methods, both in terms of accuracy metrics and computation times. We apply this approach to a cancer genomics data example.

16.
J Classif ; 35(1): 29-51, 2018 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-30287977

RESUMEN

This paper discusses the challenges presented by tall data problems associated with Bayesian classification (specifically binary classification) and the existing methods to handle them. Current methods include parallelizing the likelihood, subsampling, and consensus Monte Carlo. A new method based on the two-stage Metropolis-Hastings algorithm is also proposed. The purpose of this algorithm is to reduce the exact likelihood computational cost in the tall data situation. In the first stage, a new proposal is tested by the approximate likelihood based model. The full likelihood based posterior computation will be conducted only if the proposal passes the first stage screening. Furthermore, this method can be adopted into the consensus Monte Carlo framework. The two-stage method is applied to logistic regression, hierarchical logistic regression, and Bayesian multivariate adaptive regression splines.

17.
J Am Stat Assoc ; 113(521): 401-416, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30078920

RESUMEN

We consider the problem of multivariate density deconvolution when interest lies in estimating the distribution of a vector valued random variable X but precise measurements on X are not available, observations being contaminated by measurement errors U. The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density of U is not known but replicated proxies are available for at least some individuals. Additionally, we allow the variability of U to depend on the associated unobserved values of X through unknown relationships, which also automatically includes the case of multivariate multiplicative measurement errors. Basic properties of finite mixture models, multivariate normal kernels and exchangeable priors are exploited in novel ways to meet modeling and computational challenges. Theoretical results showing the flexibility of the proposed methods in capturing a wide variety of data generating processes are provided. We illustrate the efficiency of the proposed methods in recovering the density of X through simulation experiments. The methodology is applied to estimate the joint consumption pattern of different dietary components from contaminated 24 hour recalls. Supplementary Material presents substantive additional details.

18.
PLoS One ; 13(7): e0195070, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30059495

RESUMEN

Significant advances in biotechnology have allowed for simultaneous measurement of molecular data across multiple genomic, epigenomic and transcriptomic levels from a single tumor/patient sample. This has motivated systematic data-driven approaches to integrate multi-dimensional structured datasets, since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions between them. We propose a novel multi-scale Bayesian approach that combines integrative graphical structure learning from multiple sources of data with a variable selection framework-to determine the key genomic drivers of cancer progression. The integrative structure learning is first accomplished through novel joint graphical models for heterogeneous (mixed scale) data, allowing for flexible and interpretable incorporation of prior existing knowledge. This subsequently informs a variable selection step to identify groups of co-ordinated molecular features within and across platforms associated with clinical outcomes of cancer progression, while according appropriate adjustments for multicollinearity and multiplicities. We evaluate our methods through rigorous simulations to establish superiority over existing methods that do not take the network and/or prior information into account. Our methods are motivated by and applied to a glioblastoma multiforme (GBM) dataset from The Cancer Genome Atlas to predict patient survival times integrating gene expression, copy number and methylation data. We find a high concordance between our selected prognostic gene network modules with known associations with GBM. In addition, our model discovers several novel cross-platform network interactions (both cis and trans acting) between gene expression, copy number variation associated gene dosing and epigenetic regulation through promoter methylation, some with known implications in the etiology of GBM. Our framework provides a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers.


Asunto(s)
Neoplasias Encefálicas/diagnóstico , Epigénesis Genética , Regulación Neoplásica de la Expresión Génica , Glioblastoma/diagnóstico , Proteínas de Neoplasias/genética , Transcriptoma , Atlas como Asunto , Teorema de Bayes , Neoplasias Encefálicas/genética , Neoplasias Encefálicas/mortalidad , Neoplasias Encefálicas/patología , Gráficos por Computador , Variaciones en el Número de Copia de ADN , Metilación de ADN , Conjuntos de Datos como Asunto , Dosificación de Gen , Redes Reguladoras de Genes , Genómica/métodos , Glioblastoma/genética , Glioblastoma/mortalidad , Glioblastoma/patología , Humanos , Proteínas de Neoplasias/metabolismo , Medicina de Precisión , Pronóstico , ARN Mensajero/genética , ARN Mensajero/metabolismo , ARN Neoplásico/genética , ARN Neoplásico/metabolismo , Análisis de Supervivencia
19.
J Am Stat Assoc ; 113(524): 1733-1741, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30739967

RESUMEN

We develop a Bayes factor based testing procedure for comparing two population means in high dimensional settings. In 'large-p-small-n' settings, Bayes factors based on proper priors require eliciting a large and complex p×p covariance matrix, whereas Bayes factors based on Jeffrey's prior suffer the same impediment as the classical Hotelling T 2 test statistic as they involve inversion of ill-formed sample covariance matrices. To circumvent this limitation, we propose that the Bayes factor be based on lower dimensional random projections of the high dimensional data vectors. We choose the prior under the alternative to maximize the power of the test for a fixed threshold level, yielding a restricted most powerful Bayesian test (RMPBT). The final test statistic is based on the ensemble of Bayes factors corresponding to multiple replications of randomly projected data. We show that the test is unbiased and, under mild conditions, is also locally consistent. We demonstrate the efficacy of the approach through simulated and real data examples.

20.
Biometrika ; 103(4): 985-991, 2016 12.
Artículo en Inglés | MEDLINE | ID: mdl-28435166

RESUMEN

We propose an efficient way to sample from a class of structured multivariate Gaussian distributions. The proposed algorithm only requires matrix multiplications and linear system solutions. Its computational complexity grows linearly with the dimension, unlike existing algorithms that rely on Cholesky factorizations with cubic complexity. The algorithm is broadly applicable in settings where Gaussian scale mixture priors are used on high-dimensional parameters. Its effectiveness is illustrated through a high-dimensional regression problem with a horseshoe prior on the regression coefficients. Other potential applications are outlined.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...